1 Introduction

For this project the data from the “VII Encuesta de Presupuestos Familiares” (VII Household Budget Survey) was selected. This is a survey done every 5 years in Chile by means of a survey to a representative sample of homes, which allows to know the expenditure structure and consumption patterns of the Chilean families and individuals. The main use for this survey the construction of a list of goods and services whose costs are monitored monthly to calculate the “IPC”, or consumer price index, which is the main index used to measure inflation in Chile. However, this survey is also an important tool that gathers and collects an import ammount of socioeconomic information for urban households and their inhabitants, including age, gender, housing tenure, incomes, education and working conditions.

The data used for this work is splited in two data sets:

Some data wrangling will be needed before starting some of the exploratory data analysis, given that the households data set have one entry by each household member, while the expenses data set contains the expenses only by household not separated by household inhabitant: this means that is possible to merge the data by matching the household ids, but not by individuals (because that was the intended use for the data).

2 Loading libraries and data set

The following libraries were used for this work:

The data is stored in RData files, after being transformed from SPSS data sets.

load("households.RData")
load("expenses.RData")

Some cleaning is still needed, for example there two negative ages and some households without a total income reported, because of missing data.

households <- subset(households, age >= 0 & !is.na(income.hh.av.rent))

3 Univariate Plots and Analysis.

Let’s start with some simple explorations of the population in our data set. From the variables descriptions we decided to focus on the following variables:

3.1 Household’s inhabitants Age Distribution

What is the population’s age distribution? The summary function can give us a start:

Age Distribution Summary
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 17 32 34.9 51 103

So, the average population’s age is 32 years old. However, a plot will give us much more information about the age distribution of the population surveyed in our data set. The next figure shows an histogram using the age variable (a discrete numerical variable). The binwidths are equal to 1 year. The plot shows that the population is not normally distributed and positevely skewed overall. This is expected, since the population must decrese with age as people dies by accidents, illness or natural causes.

However is interesting to notice some peaks at around 5, 25 and 50 years of age: they might correspond to generations with higher natality rates or less infant mortality.


3.2 Educational attainment

The educational attainment measures the educational level attained by an individual. In our data set the variable that measures this is called edu.level, and is a categorical variable stored as a factor with 16 levels plus a “NA” for missing information. The next figure shows a bar plot with the distribution of this variable for our population. A logarithmic scale was used in the Y axis to display better the differences, as some categories have a small number of cases.


3.3 Kinship

The variable kinship is a categorical variable stored as a factor vector with 14 levels indicating the kinship relation of the different individuals with respect to the household head. We present a figure with a bar plot to analyse this variable distribution. Notice that again the Y axis scale is logarithmic so we don’t loose information given the small number of cases of some relationships.

Most of the inhabitants of a household have a direct relation with the household head: they are either the children or the spouse in most cases.


3.4 Household Tenure

The household tenure studies the kind of tenure a houshold has over its main dwelling place. tph is a categorical variable which again is stored as a factor vector, with 9 levels plus a “NA” category for missing information.

The following figure shows a bar plot with the percentage each category represents over the total number of households surveyed. Over 60% of the households are either fully owned or owned through a mortage still being payed. I was expecting a higher percentage of households paying a rent, however only around the 15% of the households fall in this category.


3.5 Household’s Income

We analyse here the monthly household’s income. A household’s income adds up all the incomes from the inhabitans of a household, including the incomes from dependent work activities, independant works, rents, social helps, financial instruments, pentions, etc. Our data set includes 10517 households. As shown in the following table, the summary of this variable tell us that the median income is US$1110 while the mean is US$1707, and 75% of the households earn less than US$1976.

Avilabe Income by Household Summary
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.96 653.4 1110 1707 1976 53280

The next figure plots an histogram of the household’s income.


We can see that most of the data in under US$2000. For the next figure we change the X axis scale from linear to logarithmic, and we use geom_density instead of geom_histogram.

The plot shows a better description of the household’s income. We can see the peak at around US$1100, which is the median of the income distribution. However this plot can be misleading for some readers, because it could lead them to believe that the wealth is well distributed within the chilean’s households. —

So, we plot once again a histogram with linear scales in both axis, but we will remove the 10% of the households with higher incomes. The next figure shows the result.


The reader can wonder about the 10 kind of arbitrary ticks used in the X axis. These values represent the limits of the income deciles for the chilean household’s incomes. The following table ilustrate these deciles.

Table with percentiles calculated to create Income Deciles
  0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Percentil 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
US.Dolars 3 420 576 739 908 1110 1369 1714 2310 3610 53280

These deciles were calculated using the quantile function, where we looked for the limits that would separate the households in 10 different deciles, each one compromising a 10% of the households. Using this new variable that we called income.dec, we created a bar plot that is shown in the next figure. The plot shows how the wealth is distributed in the chilean households: the decile with the top 10% of households with more income is getting around the 35% of the total wealth.


3.6 Inhabitants by household

The variable num.inhabitants holds the number of inhabitans by household. The following figure shows a histogram with the distribution of the inhabitant number by household.

From the following summary table we can conclude that the median population by household is 3, while the mean population is 3.389 inhabitants.

Min. 1st Qu. Median Mean 3rd Qu. Max.
1 2 3 3.389 4 15

3.7 Educational Institutions

In this section we study the variable edependence. This variable holds the kind of educational institution the students in our population are attending. This variable is categorical, and is stored as factor with 12 levels, including 1 level for the individuals not studying. In the next figure we show the distribution ordered by the number of cases by institution, removing the population that is not currently attending any institution or studying.


Regarding the primary and secundary education, the next figure shows a bar plot with the percentage of students that are attending either a public school, a private school or an state subsidized institution.

3.8 Health Expenditure

The variable health.exp has the information related to how much the surveyed population is expending monthly in health, mainly expenses in health insurance, either social or private, which is mandatory for people in dependent or independent works.

Health Expenditure Summary Table
Statistic Value
Min. 2.402
1st Qu. 27.24
Median 44.36
Mean 80.98
3rd Qu. 91.5
Max. 2279

The previous table states that the median expenditure is US$44, with a mean of US$80.98, and a maximum amount of US$2279. How is the expenditure distributed? The following figure shows an histogram with this information.


As with other variables in previous sections, there are some outliers that prevent us to see the distribution in greater detail. So we create a new plot with a logarithmic x axis scale.

3.9 Univariate Analysis and reflection

All of the variables explored in this section come from the household data set. Some of the variables are by household and other by individuals. The most interesting features were the ones analysed: population age, educational attainment, kinship, household tenure, household’s income, educational institutions and health expenditure.

From the exploration, we found out that the Chilean population is young overall, with 50% of the population under 32 years old. From the number of inhabitants by household and the relationship of the dwellers with the houshold chief we can also conclude that households are mainly composed by families with parents and childrend living together.

One of the most important conclusions is that the wealth distribution in Chile seems to present high levels or inequality, with 50% of the households earning less than US$1100 monthly, and the highest decil earning more than US$3610 and concentranting around the 35% of the wealth.

4 Bivariate/Multivariate Plot and Analysis.

4.1 Population Pyramid

We can arrange a little bit more the plot built on Household’s inhabitants Age Distribution and create a population pyramid:

The peaks seem to change for each gender! We can also notice that there are more women (53.2%) than men (46.8%). Are the gender’s average age different?

We see a difference in the average age for both gender, with males having an overall younger population.

  Min. 1st Qu. Median Mean 3rd Qu. Max.
Men 1 16 30 33.49 50 101
Women 1 18 34 36.15 52 103

Let’s test if the difference in statistical significant by using the Wilcoxon Rank Test:

Wilcoxon rank sum test with continuity correction: age by gender
Test statistic P value Alternative hypothesis
143282477 2.947e-28 * * * two.sided

The test confirms that there the difference in age between genders is significant with p < 0.05.

4.2 Educational Attainment and Age

We study now the relation between educational attainment and age. The next figure explore this relation for two groups taken from the population that is currently not studying (as reported by the variable studying): one group includes the whole population (top bar plot); the other includes only the population not studying and over 30 years old. The variables used are:

  • studying: indicates wheter the person is currently studying.

  • age: individual age, in years.

  • edu.level: education attainment. Factor with 16 levels, plus a NA option.

The black vertical line separates the levels related to primary and secundary education (to the left) and tertiary or higher education attainment (to the right). Not much difference is seen between both plots, but we wanted to be sure that we were not including individuals that have not actually finished they education, even when they are reported as not studying at the moment of the survey.

4.3 Educational Attainment and Gender

We will study the relation between education attainment and gender, using an stacked histogram.


Because the number of women on our sample is bigger than the number of men, we will use instead the percentage of each gender for each level.

We don’t see major differences in educational attainment for different genders.

4.4 Wage Income and Age

4.5 Wage Income, Age and Gender.

4.6 Household’s Income and Expenses

4.7 Top Expenditures

What are households expending on?

4.8 Household Income Decil and Expenditures

D Code Description
01 Food and non-alcoholic beverages
02 Alchoholic beverages, tobacco and narcotics
03 Clothing and footwear
04 Housing, water, electricity, gas and other fuels
05 Furnishings, household equipment and routine household maintenance
06 Health
07 Transport
08 Communication
09 Recreation and culture
10 Education
11 Restaurants and hotels
12 Miscellaneous goods and services

5 Final Plots and Summary

5.1 Plot One

So, while the mean household income is US$1707 the median is at US$1110.

5.2 Plot Two

5.3 Plot Three

6 Reflection